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Abstract.  This  paper  proposes  a  few  lower  bounds  for  communication  complexity  of  the  Gaussian 
Elimination  algorithm  on  multiprocessors.  Three  types  of  architectures  are  considered:  a  bus 
architecture,  a  nearest  neighbor  ring  network  and  a  nearest  neighbor  grid  network.  It  is  shown  that 
for  the  bus  and  the  ring  architectures,  the  minimum  communication  time  is  of  the  order  of  0(N 2), 
independent  of  the  number  of  processors,  while  for  the  grid  it  is  reduced  to  0(k~l^2N2)+0(kl''2N), 
where  k  is  the  total  number  of  processors.  The  practical  implications  of  these  bounds  are  discussed. 
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1.  Introduction 

The  purpose  of  this  paper  is  to  analyse  the  effects  of  data  movement  delays  in  parallel  Gaussian 
elimination  algorithms.  In  the  early  research  publications  on  parallel  algorithms  the  communication 
times  were  often  not  accounted  for  [4],  [10]  and  a  parallel  algorithm  was  juged  satisfactory  whenever 
the  so-called  speed-up  for  the  arithmetic  was  sufficiently  close  to  the  number  of  processors  used. 
It  soon  became  clear  that  such  analyses  could  be  overly  simplistic  and  misleading.  Gentleman  [3] 
was  probably  the  first  to  warn  against  such  abuses  by  establishing  that  in  parallel  computation, 
communication  complexity  may  be  more  limitative  than  arithmetic  complexity.  Most  of  the  recent 
analyses  of  parallel  algorithms  use  timing  models  that  take  into  account  communication  times  in 
some  form,  see  e.g.  [2,  5,  7,  8,  9]. 

This  paper  focusses  on  the  complexity  of  the  Gaussian  elimination  algorithm,  with  the  view¬ 
point  that  the  time  for  transmitting  one  data  item  is  not  negligible  relative  to  the  time  to  perform 
an  arithmetic  operation.  In  [5],  several  simple  parallel  Gaussian  elimination  algorithms  have  been 
examined  with  a  systematic  incorporation  of  communication  costs  in  the  overall  complexity  analysis. 
The  architecture  considered  there  consists  of  a  ring  of  a  relatively  small  number  of  processors,  with 
nearest  neighbor  connection  and  a  bus  interconnection.  It  was  shown  that  when  the  number  of 
processors  is  small  relative  to  the  problem  size  then  the  communication  time  is  a  low  order  term 
as  compared  with  the  time  to  perform  arithmetic. 

In  the  present  paper  we  establish  further  theoretical  results  concerning  the  communication 
complexity  of  Gaussian  elimination  on  multiprocessors,  emphasizing  the  case  when  the  number  of 
processors  is  large  relative  to  the  size  of  the  problem.  In  particular,  we  will  attempt  to  clarify  the 
effect  of  the  architecture  on  the  communication  complexity,  by  putting  in  contrast  the  effectiveness 
of  three  simple  configurations  namely  a  bus,  a  ring,  and  a  2  -  D  grid  array. 

One  of  our  main  results  is  that  for  the  bus  and  the  ring  topologies,  reducing  an  NxN  system 
into  upper  triangular  form  by  the  Gaussian  elimination  algorithm  requires  a  communication  time  of 
at  least  0(N2),  no  matter  how  many  processors  are  used.  This  can  be  considered  as  a  generalization 
of  a  result  by  Gentleman  [3].  As  a  consequence,  even  though  arithmetic  speed-ups  as  high  as  N 2 
can  be  achieved  thus  leading  to  a  time  linear  in  N  for  performing  the  arithmetic  operations,  the 
communication  time  will  never  be  reduced  below  the  above  lower  bound  of  0(N2). 

One  reason  for  this  limitation  is  intuitively  clear:  besides  the  “arithmetic”  parallelism  used 
to  speed-up  arithmetic,  one  also  needs  “communication”  parallelism  to  speed-up  communication. 
Grid  arrays  are  one  way  of  achieving  communication  parallelism,  since  several  paths  can  be  used 
simultaneously.  However,  as  the  number  of  processors  increases,  a  second  limitation  appears:  the 
distance  that  a  data  item  must  travel  will  increase  on  the  average  thus  leading  to  higher  latency 
times.  As  a  result,  it  will  be  seen  that  in  2 —D  architectures  the  lower  bound  for  communication  time 
becomes  of  the  order  of  0(k~l'2N2)  +  0(kl/2N).  An  important  consequence  is  that  the  Gaussian 
elimination  algorithm  cannot  be  achieved  in  linear  time  on  a  two-dimensional  grid  array  no  matter 
how  many  processors  are  used.  In  fact  the  best  time  that  can  be  achieved  is  a  disappointing 

o(.v5/3). 

In  Section  2  we  describe  our  three  models  and  propose  two  different  general  techniques  for 
establishing  lower  bounds  for  communication  times.  In  Sections  3, 4  and  5  we  propose  lower  bounds 
for  communication  times  for  Gaussian  elimination  on  the  bus,  the  ring  and  the  grid  networks 
respectively.  In  Section  6,  the  bounds  for  the  grid  are  illustrated  on  a  simple  example  and  in 
Section  7  a  tentative  conclusion  is  drawn. 
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2.  Models  and  basic  properties 


2.1.  Description  of  the  models 

We  consider  a  multiprocessor  system  consisting  of  k  identical  processors,  each  with  its  own 
memory  sufficiently  large  to  hold  its  share  of  the  data.  We  assume  that  there  is  no  global  memory 
shared  by  the  processors,  or  if  there  is  such  a  memory  that  it  is  not  used  for  passing  data  between 
processors.  Unless  otherwise  stated,  the  number  of  processors  does  not  exceed  N2.  The  linear 
system  to  solve  is  Ax  =  /,  where  A  is  a  real  NxN  matrix.  Throughout  the  paper  it  is  assumed 
that 

•  each  of  the  k  processors  holds  elements  of  the  matrix  A  (equidistribution  of  the  data). 

•  no  element  of  A  belongs  to  more  than  one  processor  (exclusivity  assumption). 

Since  we  are  seeking  for  lower  bounds,  the  additional  commnunication  time  related  to  moving  the 
right  hand  side  is  neglected  for  simplicity,  so  we  are  not  concerned  with  the  way  in  which  the  right 
hand  side  is  distributed  among  the  processors. 

There  are  many  different  ways  in  which  the  processors  can  be  interconnected.  In  this  paper 
we  are  interested  in  the  following  three  simple  topologies. 

1.  The  first  and  simplest  architecture  consists  of  a  broadcast  bus  linking  the  k  processors  as 
illustrated  in  Figure  1.  In  this  model  to  send  a  vector  of  length  m  from  one  processor  to  any 
number  of  processors  via  the  broadcast  bus  requires  a  time  of 

mrB,  (2.1) 

where  rB  will  be  referred  to  as  the  elemental  transfer  time  for  the  bus,  and  is  expressed  in 
seconds  per  word. 

2.  The  second  architecture  consists  of  a  nearest  neighbor  interconnection  ring  as  shown  in  Figure  2. 
To  send  a  vector  of  length  m  from  one  processor  to  one  of  its  neighbors  via  the  local  links 
requires  a  time  of 

mrR,  (2.2) 

where  tr  is  the  elemental  transfer  time  of  the  local  links,  in  seconds  per  word.  Any  processor 
can  send  (or  receive)  a  data  item  from  one  neighbor  while  sending  (or  receiving)  another  data 
item  from  the  other  neighbor. 

3.  .  The  third  architecture  consists  of  a  grid  array  of  k  processors  arranged  on  a  square  grid  with 
y/k  processors  on  each  side,  see  Figure  3.  We  will  often  use  the  term  2  -  D  array  for  grid  array. 
The  assumptions  are  identical  with  those  of  the  above  model  but  one  processor  is  now  able  to 
simultaneously  communicate  with  (at  most)  four  neighbors.  The  elemental  transfer  time  for 
the  grid  is  denoted  by  tq. 

Ipsen,  Saad  and  Schultz  [5]  utilize  a  more  general  formula  for  estimating  transfer  times,  in 
which  a  (constant)  start-up  time  0  is  added  to  the  above  times.  Since  we  are  seeking  for  lower 
bounds  for  communication,  we  feel  that  taking  0  =  0  is  sufficiently  representative,  though  we 
acknowledge  that  start-up  times  may  very  well  dominate  in  machines  with  large  0. 

An  important  assumption  we  will  make  for  all  three  models  is  that  arithmetic  cannot  be 
overlapped  with  communication  in  any  processor.  More  specifically,  the  algorithm  considered  takes 
on  the  following  general  form. 

For  j  —  1,2,  ..N  -  1  do: 

(a)  Data  Movement:  Read  the  pivot  information  (i.e.  the  j,k  row  and  the  pivots)  needed  by 
any  processor  to  perform  its  part  of  the  elimination. 

(b)  Arithmetic:  Perform  the  jth  step  of  Gaussian  elimination. 


Figure  1:  Broadcast  bus  architecture. 


Each  task  (step  j)  of  the  algorithm  consists  of  a  subtask  of  data  movements  whereby  the 
processors  get  the  pivot  information  needed  to  perform  the  eliminations  and  then  a  subtask  of 
arithmetic  computations,  and  these  two  subtasks  are  not  overlapped.  A  similar  assumption  was 
made  by  Gannon  and  Van-Rosendale  (2j.  Thus  one  can  naturally  speak  of  a  communication  time 
(the  total  time  for  communication  subtasks)  and  an  arithmetic  time  (the  total  time  for  arithmetic 
subtasks).  More  generally,  the  communication  time  can  be  defined  as  the  execution  time  of  the 
given  algorithm  under  the  assumption  that  the  times  for  performing  arithmetic  are  neglected.  The 
arithmetic  time  is  defined  in  a  similar  manner,  i.e.  as  the  execution  time  when  transfer  times  are 
neglected.  These  definitions,  which  are  consistent  with  those  defined  above  for  the  non-overlapping 
case,  have  also  been  used  by  Gannon  and  Van-Rosendale  [2]. 

The  reason  for  the  non-overlapping  assumption  is  essentially  pedagogical  and  can  be  justified  as 
follows.  Models  where  arithmetic  and  communication  can  be  overlapped,  are  more  complicated  to 
analyze.  Yet  observe  that  the  execution  time  of  an  algorithm  in  an  environment  where  overlapping 
is  assumed  is  within  a  factor  of  only  two  of  that  of  the  same  algorithm  run  in  an  environment 
where  no  overlapping  is  assumed.  Indeed,  consider  any  algorithm  and  let  tA  be  the  total  time  to 
perform  arithmetic  and  <t  be  the  total  time  required  for  data  transfers.  In  the  overlapping  case 
the  total  execution  time  <£  will  be  at  least  Max{tx,  <r}  which  represents  the  time  with  maximum 
overlapping.  In  the  non-overlapping  case  it  will  be  simply  tA  +  tj  which  satisfies 

+  It)  <  moz{f^,tr}  <  ts  <  tA  +  tr- 

2.2.  Two  types  of  lower  bounds  for  communication  times 

We  start  by  considering  a  data  transfer  operation  which  is  essential  in  Gaussian  elimination 
and  which  consists  in  sending  a  vector  of  m  elements  from  one  processor  to  all  others  or  to  a  few 
neighboring  other  processors.  Using  the  broadcast  bus  this  type  of  transfer  operation  will  consume 
a  time  of  mrg  independent  of  the  number  of  processors  to  be  reached. 

For  Models  2  and  3.  an  efficient  way  of  achieving  this  type  of  data  movement,  is  to  pipeline 
the  data  as  follows.  Assume  that  the  data  is  to  be  moved  from  Processor  Pj  to  the  sequence  of 
i  consecutive  processors  P2.P3..PH.1.  Then  in  step  1  the  first  element  is  sent  from  Pi  to  Pj.  In 
step  2,  while  sending  the  first  element  to  P3,  Pj  receives  the  second  element  from  Pj.  Generally,  at 
step  j,  the  first  element  reaches  P;+i  from  P;  while  the  second  element  follows  from  P,_i  to  Pj,  etc. 
The  first  element  reaches  Pj+ 1  at  the  «**  step  and  m  -  1  more  steps  are  needed  for  the  remaining 


m  -  1  elements  to  enter  P;+i,  resulting  in  a  total  of  (m  -  1)  + 1  steps.  Therefore,  transfering  a 
vector  of  length  m  to  i  consecutive  processors  takes  the  time 

t(m,i)  =  (m- l  +  i)r,  (2.3) 

where  r  stands  for  either  of  tr  (ring)  or  tq  (grid).  Note  that  because  of  the  overlapping  of  data 
transfers,  the  above  time  represents  also  the  time  needed  for  sending  a  vector  of  length  m  from  one 
processor  to  all  those  that  are  distant  by  t  local  links  from  it.  For  the  second  and  third  model,  the 
distance  between  two  processors  is  the  minimum  number  of  local  links  to  cross  in  order  to  reach 
one  processor  from  the  other. 

It  is  important  to  distinguish  between  the  two  times  (m  -  l)r  and  tr  in  (2.3).  The  first  term 
(m  -  l)r  reflects  the  actual  speed  of  the  local  links:  the  higher  the  bandwidth  of  the  local  links, 
the  faster  the  data  transfer,  given  the  same  amount  of  data.  The  second  time  is  what  might  be 
termed  latency  time,  as  it  corresponds  to  the  time  required  to  fill  all  i  links.  The  latency  time  also 
represents  the  longest  distance  that  any  data  item  must  travel,  and  is  to  compare  with  the  start-up 
times  of  pipelined  arithmetic  units. 

We  now  outline  how  we  establish  our  lower  bounds  for  communication  times.  For  the  broadcast 
bus,  simply  observe  the  important  fact  that  sending  a  vector  of  length  m  to  any  number  of  processors 
contributes  a  time  of  mrs  to  the  total  communication  time.  Thus,  in  order  to  derive  a  lower  bound 
for  the  communication  time  for  the  bus  network  we  only  need  to  count  the  number  of  occurances 
of  elemental  data  transfers  from  one  processor  to  at  least  another  one. 

For  the  nearest  neighbor  interconnections  (Models  2  and  3),  it  is  not  as  simple  because  transfers 
in  several  edges  of  the  array  can  take  place  simultaneously.  Also  the  distance  which  a  data  item 
travels,  i.e.  the  number  of  edges  that  it  crosses,  is  now  important. 

One  way  to  handle  these  difficulties  is  to  count  the  total  number  n  of  elementary  transfers  from 
any  one  processor  to  a  neighbor.  In  other  words  n  represents  the  total  over  all  edges  of  the  number 
of  times  that  these  edges  are  crossed.  Since  there  are  only  k  edges  for  the  ring  and  2(k  -  y/k) 
edges  for  the  square  grid,  a  lower  bound  for  communication  times  will  be  given  by  jTr  for  the  ring 
and  [n/(2k  -  2\/k)]rc  for  the  grid.  We  will  refer  to  these  lower  bounds  as  the  “bandwidth  lower 
bounds”  as  they  take  into  consideration  the  total  bandwidth  of  the  system. 

As  an  example  consider  the  application  of  this  reasoning  to  the  problem  of  sending  a  vector  of 
length  m  from  Pi  to  all  processors  in  a  k-processor  ring.  Each  of  the  m  elements  must  cross  k  -  1 
edges,  so  the  total  of  elementary  transfers  is  n  =  m(k  -  1).  From  the  above  argument,  it  will  take 
a  minimum  of  jTr  =  m(l  -  1  /k)rR  to  achieve  the  data  transfer.  This  bandwidth  lower  bound  is 
nothing  but  the  best  possible  time  in  which  the  n  elementary  transfers  could  be  realized,  if  we  could 
keep  all  k  edges  saturated  during  the  whole  data  transfer.  However,  notice  that  as  was  explained 
earlier  this  saturation  state  will  not  be  reached  before  a  start-up  time  of  ( k  —  1)tr.  Therefore,  the 
lower  bounds  obtained  by  this  technique  does  not  account  for  the  latency  time  which  is  (k  -  I )tr 
in  this  example. 

This  leads  us  naturally  to  a  second  mecanism  for  obtaining  lower  bounds  for  transfer  times, 
which  consists  in  simply  considering  the  latency  time  as  a  lower  bound  for  the  total  communication 
time.  In  the  above  example  this  would  lead  to  the  lower  bound  ( k  -  1  )tr  which  may  indeed  be 
sharper  than  the  above  bandwidth  lower  bound  when  k  is  large  with  respect  to  m.  This  second 
type  of  lower  bounds  will  be  referred  to  as  the  “latency  bounds”  for  obvious  reasons.  Latency 
lower  bounds  are  likely  to  be  sharper  if  the  number  of  processors  is  large  with  respect  to  the  data 
length,  while  the  bandwidth  bounds  will  be  sharper  if  the  number  of  processors  is  small.  As  will 
be  seen,  a  good  lower  bound  should  take  into  account  both  types  of  bounds.  Gentleman  [3]  uses 
a  model  where  k  =  N 2,  i.e.  each  processor  holds  one  element  of  the  matrix,  and  his  lower  bounds 
are  latency  type  lower  bounds. 
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Figure  4:  The  jth  step  of  Gaussian  elimination. 

3.  Communication  complexity  for  the  broadcast  bus  architecture 

In  this  section,  a  lower  bound  for  the  communication  time  is  derived  for  the  Gaussian  elim¬ 
ination  algorithm  without  pivoting  implemented  on  a  broadcast  bus  network.  Recall  that  the 
fundamental  assumptions  are  that  the  matrix  A  is  equidistributed  among  the  k  processors,  and 
that  no  data  is  to  be  in  more  that  one  processor  (Exclusivity  assumption).  Note  that  the  matrix 
A  can  be  assigned  in  any  possible  fashion  provided  each  processor  holds  N2/k  elements. 

As  was  already  stated  in  Section  2,  moving  a  group  of  m  elements  to  any  number  of  processors 
will  add  a  time  of  at  least  mr#  to  the  total  communication  time.  We  can  therefore  use  an  ele¬ 
ment  by  element  approach  to  get  the  desired  lower  bound  for  the  communication  complexity:  the 
communication  time  is  at  least  tb  times  the  number  of  elements  that  must  be  moved  to  at  least 
one  other  processor  during  the  algorithm  .  The  following  lemma  establishes  a  lower  bound  for  this 
number. 

Lemma  3.1.  During  the  Gaussian  elimination  algorithm  the  number  ns  of  matrix  elements  which 
must  be  moved  from  one  processor  to  at  least  one  other  processor  satisfies: 

_  _  W(JV  + 1)  h(h+l) 

»»> — 2 - — ’ 

where 

h  =  [2N/Vzk\. 


Proof,  Consider  the  jth  step  of  the  algorithm  as  depicted  in  Figure  4.  We  assume  that  j  <  N  -  h 
for  reasons  explained  later.  Any  mention  of  row  (or  column)  in  this  proof  will  refer  to  the  part  of 
the  row  (or  column)  of  A  consisting  of  its  last  (N  -  j  +  l)  elements. 
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Any  element  ajj,l  =  j,j  +  l,..N  must  be  known  to  all  processors  that  hold  at  least  one  element 
of  the  column  l  below  ajj.  Similarly,  any  given  element  atJ,  i  =  j  +  1,  ..N  must  be  known  to  the 
processors  that  hold  at  least  one  element  of  row  i,  at  the  right  of  atJ,  in  order  that  these  processors 
be  able  to  compute  the  pivots. 

We  would  like  to  show  that  at  least  M  =  N  —  j  +  1  elements  among  those  elements  of  the 
jtk  row  and  column  must  be  moved  at  step  j.  Assume  that  a  elements  of  row  j  are  not  moved 
anywhere.  If  a  —  0  there  is  nothing  to  prove,  so  assume  a  ^  0.  If  an  element  ajj  of  the  row  is  not 
moved,  this  means  that  the  whole  l,h  column  below  that  element  belongs  to  the  same  processor. 

We  have  to  show  that  in  this  situation  at  least  a  elements  must  be  moved  from  column  j  to 
the  right.  Assume  that  this  were  not  the  case,  i.e.  that  at  most  a  -  1  of  the  M  —  1  elements 
Ojj,i  —  j  +  1  ,..2V  are  moved  or  equivalently  that  at  least  M  -  a  elements  on  that  column  are 
not  moved.  For  each  of  these  elements,  the  corresponding  row,  once  again,  belongs  entirely  to  one 
processor,  say  Pi-  We  claim  that  all  of  the  a  “unmoved"  rows  and  the  M  -  a  “unmoved”  columns 
must  belong  to  this  same  processor  Pi.  This  follows  easily  from  the  exclusivity  assumption:  if 
the  whole  of  row  i  belongs  to  one  processor,  say  Pi,  while  the  whole  of  column  Z  belongs  to  one 
processor,  say  Pj,  then  Pi  =  P2,  because  a,j  is  a  commun  element  to  Pi  and  P2. 

Now  Pi  contains  at  least 

(M  -  a )2  +  Ma  > 

4 

elements.  This  would  contradict  assumption  1  whenever  ^(iV  -  j  +  l)2  >  N2/k  which  is  the  case 
because  for  j  <  N  -  h  we  have 

N  -  j  +  1  >  [2N/V5k\  +  1  >  2N/V3k. 

This  explains  our  initial  restriction  on  j. 

We  have  therefore  proved  that  for  each  step  j  such  that  j  satisfies  j  <  N  —  h,  at  least  N  -j  +  1 
elements  must  be  moved.  Summing  up  for  j  =  1, 2,  ..N  —  h  we  get  the  desired  lower  bound  for  the 
number  ng  of  elements  which  are  moved  from  one  processor  to  at  least  one  other  processor  during 
the  algorithm: 

N-h  NNN 

£(*->+ !)=  E  *'  =  E*'- E,'  =  oiV(iV+1)-o'i(,i+1)- 


As  a  consequence  of  the  above  lemma  we  obtain  the  following  theorem. 

Theorem  3.1.  If  k>  1  then  the  total  communication  time  required  to  perform  the  Gaussian  elimi¬ 
nation  algorithm  on  a  broadcast  bus  architecture  satisfies: 

,  .N1  2N\ 

T)tb' 

Proof.  The  function  x  — »  x(x  +  l)/2  is  an  increasing  function  of  the  variable  x  when  x  >  0  and 
therefore 

,w«^l/2iVw2  N  2  N2  N 

h(h  +  l)/2_  -(  )(  +1)-  3k  +  ^ 


When  k  >  1,  the  term  between  parentheses  in  the  right  hand  side  of  the  above  expression  is  positive. 
This,  with  the  observation  made  before  lemma  3.1  completes  the  proof. 


The  theorem  shows  that  with  a  broadcast  bus  interconnection  network,  and  for  k  >  2,  the 
execution  time  for  Gaussian  elimination  will  be  at  least  quadratic  in  2V,  no  matter  how  many 
processors  are  used.  The  minimum  of  the  above  bound  is  reached  for  k  =  2  and  yields  tg  >  ^N2rg. 


4.  Communication  complexity  for  the  multiprocessor  ring 

Following  the  first  method  suggested  in  Section  2.2  for  establishing  lower  bounds  on  commu¬ 
nication  complexity,  we  start  by  establishing  a  lower  bound  for  the  total  number  of  elementary 
transfers,  i.e.  the  overall  total  number  of  movements  of  any  one  element  from  one  processor  to  a 
neighbor.  The  following  inequality  which  is  easy  to  prove  by  considering  the  integral  of  the  function 
xu  in  the  interval  [0,  N],  will  be  quite  useful  in  both  this  section  and  the  next  one: 


Vi/  >  0, 


> 


Nv+ 1 

u  +  r 


(4.1) 


Lemma  4.1.  The  minimum  number  of  elementary  data  transfers  required  to  perform  the  Gaussian 
elimination  algorithm  on  a  nearest  neighbor  k-processor  ring  satisfies: 

nR>±:N>k-\N>. 

Proof.  Consider  the  jth  step  of  the  algorithm  as  illustrated  in  figure  5.  Each  element  ay/,1  = 
j,j  +  1  ,..N  must  be  made  available  to  all  processors  containing  at  least  one  element  of  the  Ith 
column.  (We  ignore  the  right  hand  side  /).  Let  fa  be  the  minimum  number  of  edges  that  must  be 
crossed  by  a}i  in  order  to  achieve  this.  We  refer  to  the  corresponding  path  as  the  north-south  path 
of  aji.  Similarly,  let  y,-,  i  =  j  +  1  ,j  +  2,  ..JV,  be  the  length  of  the  path,  thereafter  referred  to  as  the 
west-east  path  of  a,y,  which  a,/  must  cover  in  order  to  become  available  in  each  processor  holding 
any  part  of  the  ith  row.  We  would  like  to  find  a  lower  bound  for  the  sum 


N  N 

2  +  (4-2) 

•*>>1  l-j 

Observe  that  if  M  =  (IV— j  ■+■ 1)  and  s  =  iV2/fc  then  the  MxM  matrix  {a,j,l  =  j ,  ]V;i  =  j,N}, 
is  contained  in  a  total  of  at  least  k\  =  fiV/2/s]  different  processors.  Consider  first  the  element 
ctjj  (l  =  j ),  which  is  in  some  processor,  say  Pi,  and  let  Pp  be  the  processor  among  the  above  group 
of  ki  processors,  which  is  the  farthest  away  from  P\. 

For  the  ring  it  is  easily  seen  that  given  an  arbitrary  group  of  p  processors,  any  member  of  this 
group  admits  at  least  one  processor  which  is  at  a  distance  no  less  than 

=  TO*  “  l)/2l-  (4.3) 

The  function  Sr  is  the  reciprocal  of  the  function  a  defined  by  Gentleman  [3]. 

An  important  observation  is  that  any  of  the  processors  of  the  group  of  processors  can  be 
reached  by  ajj  by  following  part  of  its  path  north-south  and  part  of  a  path  west-east  of  some  row 
»!.  As  a  result  it  is  possible  for  a}j  to  reach  Pp  by  crossing  at  most  dj  +  1  i,  edges.  Hence 

0j  +  Hi  >  SR(ki). 


(4.4) 


j  row 


Figure  5:  The  jlh  step  of  Gaussian  elimination. 

Next  consider  the  element  a)tJ+i,(l  =  j  +  1)  and  the  (A/-  l)xA/  matrix  obtained  by  deleting 
the  i;*  row.  This  matrix  is  contained  is  at  least  =  f(A/-  l)A//s]  different  processors,  and  by  the 
same  argument  as  above  there  exists  at  least  one  processor  which  is  at  a  distance  6r(Ic2)  from  that 
containing  a;j> j.  Furthermore,  this  processor  can  be  reached  by  aJJ+1  bv  crossing  part  of  its  path 
north-south  and  part  of  the  path  west-east  of  some  a,2j  and  we  have  again  9j+i  +  7,,  > 

This  argument  can  be  repeated  for  /  as  j  +  3,. ..A'  -  1.  with  at  each  time  an  inequality  of  the  form 

9i  +  'uk  >  6/t(kh),  (4.5) 

where  we  have  set  h  =  l  -  j  for  convenience  and  where 

khs\(M  -h  +  l)U/s}.  (4.6) 

Since  the  rows  i*  are  deleted  as  soon  as  their  7U  are  used,  each  row  index  1  will  appear  once 
and  only  once,  i.e.  each  7,  in  (4.2)  and  each  &i,j  <  /  <  A'  —  1  is  accounted  for  once  and  only  once. 
For  the  term  9s  in  (4.2).  which  is  not  yet  accounted  for.  we  can  use  a  separate  argument.  The  Xth 
column  in  contained  in  at  least  k\f  =  [A//s]  processors.  Note  that  this  is  also  obtained  by  letting 
h  =  XI  in  (4.6).  Hence  the  path  north-south  of  a}  y  satisfies 

9s  >  (4-7) 

Adding  the  inequalities  (4.5)  for  l  =  j.  A'  -  1  and  (4.7)  we  get 

N- 1  M 

nj  =  r  (01  +  lik)  +  9n  >Y'r  &R(kh)  (4.8) 

Imj  A- 1 
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M  M 

n,  >  £r(**  -  i)/2i  >  - 1)/2 

A*1  A=1 

H 

>  i-A/8  -  ^A/ 

4s  2 

(Note  that  for  step  j  =  jV  there  is  no  data  transfer  and  n.v  =  0.  but  the  above  inequality  is  valid 
because  its  right  hand  side  is  negative.)  Summing-up  the  above  for  j  =  1....V  and  using  (4.1)  we 
get 


Replacing  s  by  .V2 /k  and  using  the  inequality 


2  -V  ( A’  +  1)  <  -V2 


(4.9) 


yields  the  result. 


.4s  an  immediate  consequence  we  have  the  following  corollary. 

Corollary  4.1.  The  communication  time  required  to  perform  the  Gaussian  elimination  algorithm 
on  a  multiprocessor  ring  is  such  that 

(1  -  (4.10) 

Proof.  The  result  follows  by  dividing  the  lower  bound  for  the  total  number  nR  of  elementary  data 
movements  given  by  Lemma  4.1.  by  the  total  number  of  links  available  in  a  ring,  which  is  k. 

I 

Next  we  derive  a  lower  bound  that  takes  into  account  latency  rather  than  total  bandwidth, 
i.e.  a  latency-type  lower  bound  as  defined  in  Section  2.2. 

Lemma  4.2.  The  minimum  communication  time  required  to  perform  the  Gaussian  elimination 
algorithm  on  a  nearest  neighbor  k-processor  ring  satisfies: 

tR  >  4  2  ±(k  -  3)iVr*.  (4.11) 

Proof.  In  the  proof  of  Lemma  4.1,  we  have  shown  that  at  each  step  j  of  Gaussian  elimination  the 
element  a;;  travels  a  distance  while  some  element  aU  J  travels  the  distance  7,,  satisfying 

0J  +  7.-t  >  «*(*i)  >  r(*i  -  l)/2l  >  (*i  -  l)/2  >  (;A/2  -  l)/2 

Hence  mai(/3,  :  q,,)  >  ^r(^i)  and  therefore  at  each  step  there  is  a  latency  time  of  at  least  3(y;.'f2- 
i)rfi.  Adding  these  times  for  j  =  1.  ..A’,  using  (4.1)  and  replacing  s  by  N2/k  gives  (4.11). 

The  times  and  tlR  given  by  (4.10)  and  (4.11)  are  the  bandwidth  and  latency  lower  bounds 
respectively.  Notice  the  important  fact  that  for  large  k,  it  is  the  latency  times  that  dominate.  Thus 
for  situations  where  k  is  small  with  respect  to  .V,  the  bandwidth  bound  is  sharper  while  for  large 
k  the  latency  bound  becomes  sharper.  A  convenient  way  of  taking  advantage  of  both  situations  is 
by  taking  the  maximum  of  both  lower  bounds  as  is  done  in  the  following  theorem. 
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Theorem  4.1.  The  minimum  communication  time  required  to  perform  the  Gaussian  elimination 
algorithm  on  a  k-processor  ring  satisfies: 

tR  >  A/aj-{4;4}  >  ^{4  +  4}  =  ^  (1-  |)  +  ^-(1  -  (4.12) 

As  an  example  when  k  =  A'2  the  execution  time  is  at  least  of  the  order  0(A'3).  Gentleman's 
result  [3j  corresponds  to  the  case  k  =  Ar2  and  is  based  on  latency  times  only.  His  lower  bound  for 
computing  the  inverse  of  a  matrix  is  0(.V2)  instead  of  0(A'3).  but  he  does  not  restrict  himself  to 
Gaussian  elimination  for  computing  -4_1.  Note  that  we  do  not  claim  that  the  above  lower  bounds 
can  be  reached.  In  fact  the  proofs  indicate  that  this  is  unlikely.  We  are  interested  only  in  the 
highest  order  terms  (with  respect  to  .V  and  k). 

5.  Communication  complexity  for  the  multiprocessor  grid 

In  this  section,  the  k  processors  under  consideration  are  connected  in  a  square  grid, 

as  shown  in  Figure  3,  where  \/k  is  an  integer.  We  follow  the  same  steps  as  in  the  previous  section 
and  start  by  showing  the  following  analogue  of  Lemma  4.1. 

Lemma  5.1.  The  minimum  number  of  elementary  data  transfers  required  to  perform  the  Gaussian 
elimination  algorithm  on  a  nearest  neighbor  \fkx\/k  multiprocessor  grid  satisfies: 

nG  >  ^N2Vk  -  A'2.  (5.1) 

Proof.  The  only  difference  with  the  proof  of  Lemma  4.1  is  that  the  function  6r  is  replaced  by  a 
new  function  6c  related  to  the  new  architecture.  For  a  group  of  p  different  processors  on  the  grid 
6c  (p)  represents  a  lower  bound  for  the  distance  from  any  member  of  this  group  to  its  most  remote 
processor  in  the  group.  A  simple  calculation  shows  that 

&c(p)  =  ~  1  ~  1)1 

but  for  convenience  we  will  use  the  simpler  lower  bound 

(5.2). 

Resuming  the  proof  of  4.1  from  (4.8)  ,  we  have 

N-i  M 

=  £(/?(  +  1*k)  +  Av  >  fc(^)  (5-3) 

l*>  h=l 


After  summation  over  j  =  1  ,..N  and  use  of  inequality  (4.1)  and  (4.9)  we  get  the  desired  result. 

I 

In  a  square  y/kx\/k  grid,  there  is  a  total  of  2 \/k{\/k  -  1)  =  2 (k  -  \/k)  links.  Therefore,  the 
above  lemma  leads  to  the  following  corollary. 


Corollary  5.1.  On  a  multiprocessor  grid  tie  communication  time  for  the  Gaussian  elimination 
algorithm  satisfies 

(M) 

Proceeding  as  in  the  previous  section,  we  obtain  the  following  latency  lower  bound. 

Lemma  5.2.  The  communication  time  for  the  Gaussian  elimination  algorithm,  implemented  on  a 
\/kxy/k  multiprocessor  grid  satisSes 


(5.5) 


Proof.  The  proof  is  similar  to  that  of  Lemma  5.1.  We  have  3,  +  > 


-  1  and  therefore 


1  I  2 

max{0j;~nl }  > 

At  each  step  of  Gaussian  elimination  we  therefore  have  a  latency  time  of  at  least  ^(\f^  -  1 ) tq ■ 
The  result  follows  by  summing  up  over  j  =  1,  ..N. 

I 

Combining  Corollary  5.1  and  the  above  lemma,  the  following  theorem  can  be  derived. 

Theorem  5.1.  When  k>l ,  the  minimum  communication  time  for  the  Gaussian  elimination  algo¬ 
rithm  implemented  on  a  y/kxy/k  multiprocessor  grid  satisfies: 

to  >  Max<i»0: ifc}  >  i +  fU  >  i  (jjL  +  (5.6) 

Proof.  Here  we  have  used  the  inequalities  y/k  -  1  <  \fk  and  k  -  y/k  >  (which  is  true  when 
y/k  >  2  i.e.  when  k>  1),  to  simplify  the  right  hand  side  of  (5.4). 

I 

The  above  lower  bound  starts  by  decreasing  when  \/k  increases  from  2  and  then  increases 
again  as  k  becomes  large.  This  shows  that  it  is  impossible  to  achieve  ever  higher  speed-up  by  using 
more  processors,  even  though  the  arithmetic  time  might  be  reduced  by  factors  as  high  as  .V2  using 
k  =  iV2  processors. 

Let  w  be  the  time  to  perform  a  pair  of  arithmetic  operations  consisting  of  an  addition  and 
a  multiplication.  The  arithmetic  operations  in  the  Gaussian  elimination  algorithm  will  consume 
a  time  of  at  best.  Adding  this  to  the  lower  bound  (5.6)  we  get  the  minimal  time  in  which 
the  Gaussian  elimination  can  be  executed  on  a  y/kxy/%  grid.  Neglecting  the  low  order  terms  the 
resulting  total  time  is  of  the  form 

Qri  — — w  +  (arj^E  +  o$N 'Zk)rR.  (5.7) 

where  o,,»  =  1,3  are  some  constants.  One  can  now  minimize  the  above  function  with  respect  to 
the  number  of  processors  k.  With  the  correct  coefficients  o, .  *  =  1.3  the  resulting  expressions  are 
complicated  and  difficult  to  interpret.  However,  in  the  next  section  we  will  propose  an  example  of 
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Figure  6:  Distribution  of  the  linear  system  in  k  =  16  proces¬ 
sors. 

implementation  of  the  Gaussian  elimination  algorithm  which  leads  to  a  timing  formula  of  the  form 
(5.7)  with  simple  coefficients.  As  will  be  seen,  the  minimum  of  the  above  function  is  of  the  order 
0(.V5/3).  This  is  not  a  big  improvement  over  the  optimal  time  0(N7)  obtained  for  the  ring. 

6.  A  Gaussian  elimination  algorithm  for  the  multiprocessor  grid 

As  an  example  of  implementation  of  the  Gaussian  elimination  algorithm  on  a  square  multipro¬ 
cessor  grid,  we  consider  now  the  simplest  distribution  of  the  data  among  the  k  processors  consisting 
in  mapping  the  matrix  naturally  onto  the  array.  The  matrix  is  partitioned  into  k  square  blocks  of 
size  (N/ \/k)x(y / \/k)  each,  while  the  right  hand  side  is  partitioned  into  \fk  equal  parts  which  are 
distributed  into  the  processors  of  the  last  column  of  the  grid.  The  situation  is  illutrated  in  Figure 
6,  where  the  numbers  in  each  block  indicate  the  processor  to  which  the  block  are  assigned. 

We  denote  by  m  the  block-size  of  A,  i.e.  m  =  Consider  now  the  time  it  takes  to  execute 

the  Gaussian  elimination  algorithm  with  no  pivoting,  on  such  a  grid.  Each  step  of  the  algorithm 
requires  that  we  move  the  pivoting  row,  i.e.  row  j,  so  that  the  processors  under  those  containing 
this  row  will  be  able  to  perform  the  eliminations.  The  jtk  row  whose  length  is  .V  —  j+l,  consists  of 
((TV  -  j  +  l)/m]  blocks  of  size  m  each,  (except  the  first  block  whose  length  is  (N  -  j  +  1)  Mod  m). 
with  each  block  in  one  processor.  Each  of  these  processors,  must  transfer  its  block  downwards  to 
all  processors  below  it. 

At  the  same  time  the  processors  containing  the  blocks  of  the  jtk  column  must  transfer  the 
pivots  to  all  processors  to  their  right.  Since  the  transfer  in  the  horizontal  and  vertical  directions 
can  be  overlapped,  the  partial  transfer  time  tj  at  step  j  will  be  the  same  as  that  of  transfering  one 
block  of  m  elements  from  one  processor  to  its  f(.V  —  j  +  l)/m]  immediate  neighbors,  i.e. 


*j  =  (f(tf ->+  *)/™l  +  m)rG 


In  order  to  simplify  the  summation  of  the  above  partial  times  through  j  =  1, -V  -  1.  we  will 
approximate  the  above  time  by 

*  pir + m)  r°- 

Summing  up  over  the  steps  j  =  1,...V  -  1  we  find  that 


Once  the  transfers  of  the  necessary  parts  of  the  j,k  row  and  column  are  completed,  each  of 
the  active  processors  will  perform  m  eliminations  each  at  the  cost  of  m2^,  where  <*?  is  the  time 
to  perform  one  multiplication  and  one  addition.  Therefore,  the  total  time  for  performing  the 
arithmetic  operations  is 

jy3 

ta.G  *  ^jpV-  (6.2) 

Adding  (6.1)  and  (6.2)  we  find  the  approximate  total  run  time 


Observe  that  the  above  total  time  is  of  the  same  form  as  (5.7).  It  is  the  sum  of  an  increasing 
and  a  decreasing  function  of  k  and,  in  general,  as  k  varies  from  k  =  1  to  A:  =  A’2,  the  execution 
time  of  the  algorithm  starts  by  decreasing  and  then  increases  again.  We  say  “in  general"  because 
this  may  not  be  true  for  some  particular  values  of  and  rG.  For  example  if  rG  is  zero,  or  if  it 
is  very  small,  then  the  above  total  time  will  always  decrease  when  k  increases  from  1  to  N2.  For 
k  =  jV2  the  time  is  of  the  form  0{N2)  which  means  that  when  k  is  much  larger  than  N,  it  may  be 
the  case  that  using  less  than  the  total  number  processors  will  be  better  than  using  all  of  them. 

This  raises  the  question  of  optimality:  what  is  the  number  of  processors  kopt  that  achieves 
the  smallest  possible  execution  time  topi  ?  Let  us  call  t(k)  the  approximate  total  execution  time 
provided  by  the  right  hand  side  of  (6.3),  i.e. 

3  i 

<(*)  =  — w  +  -VkNrG  +  -J=Tc-  (6-4) 

Differentiating  (6.4)  with  respect  to  k  leads  to  a  third  degree  equation  whose  explicit  solution  is 
rather  complicated.  Fortunately,  we  can  find  a  nearly  optimal  solution,  by  simply  considering  the 
first  two  terms  of  t{k)  as  defined  by  (6.4).  For  all  k  we  have: 

m  >  js/fcNrc  -  .VJrc  («  +  1^)  (6.5) 

where  p  is  the  ratio  of  arithmetic  time  over  transfer  time,  i.e.  p  =  w/r<j.  The  minimum  of  the  right 
hand  side  of  (6.5)  is  achieved  for 

it.  =  (4  N2p)2'3,  (6.6) 

which  is  valid  when  1  <  it.  <  N2,  i.e.  when  jfcip  <  N/ 4.  Replacing  (6.6)  in  (6.5)  yields  the 
inequality 

V*.  m  >t.m  S.VV3.  (6.7) 
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Let  us  now  show  that  the  time  t .  is  nearly  optimal.  From  the  definition  of  t(k)  we  have 


* 


topi  <  t(k .)  =  t.+ 


TG  V4/3 
(4p)‘/3‘ 


where  we  have  used  (6.6)  and  (6.4).  By  (6.7),  topt  >  t.  and  hence 

TG  ry4/3 


0  <  topl  -t.< 


(4p)2/3" 


or 

0  <  tSSLZ  L  <  (4 p<)1/3A’-'/3, 

which  means  that  t.  is  a  good  approximation  to  topl .  when  .V  is  large. 

Note  that  by  the  similarity  of  (6.4)  and  (5.7),  we  have  also  proved  that 

Theorem  6.1.  The  minimum  time  for  performing  the  Gaussian  algorithm  on  a  multiprocessor  grid 
with  an  arbitrarily  large  number  of  processors  is  asymptotically  of  the  form  0(.Y5/3). 

The  above  result  is  disappointing  in  that  the  optimal  time  0(.V5/3)  can  be  regarded  as  a  poor 
improvement  over  the  0(.V2)  time  of  the  ring,  considering  the  additional  hardware  complexity 
involved.  The  explanation  for  this  ineffiency  lies  in  the  fact  that  when  k  increases,  one  must  send 
data  to  processors  that  become  farther  and  farther  apart  from  each  other.  The  distance  behaves 
like  0{s/k)  and  appears  as  an  incresing  function  in  the  communication  time. 

A  hardware  solution  for  avoiding  the  difficulty  is  to  provide  for  a  broadcast  facility.  Suppose 
for  example  that  each  vertical  path  between  the  processors  on  the  same  vertical  line,  is  a  bus  that 
allows  for  broadcasting  and  similarly,  each  horizontal  line  is  a  broadcast  bus.  For  such  busses,  the 
time  to  send  a  vector  of  length  m  from  one  processor  to  some  other  processors  is  of  the  form  mrg. 
independent  of  the  number  of  processors  to  which  the  data  is  sent.  Thus,  with  this  new  feature, 
the  latency  times  are  inexistant  and  the  communication  time  becomes 


.V2 

,/kTB' 


i.e.  it  always  decreases  as  k  increases. 

It  may  then  appear  that  the  remedy  to  the  latency  problem  would  be  to  have  a  broadcast 
bus  in  the  horizontal  and  vertical  directions  instead  of  local  links.  However,  this  is  only  true  for 
Gaussian  elimination  which  requires  mainly  broadcast  type  data  transfers.  Other  algorithms  may 
require  more  local  data  interchange.  For  example,  in  the  solution  of  triangular  systems,  some 
algorithms  require  to  move  data  packets  of  equal  size  simultaneously  from  a  number  of  processors 
to  their  immediate  neighbors  [5j.  In  those  cases  the  bus  must  be  time-shared  by  the  processors 
which  effectively  reduces  its  actual  speed.  For  such  algorithms,  a  bus  does  not  appear  to  be  a  good 
choice. 

In  light  of  the  above  argument  it  appears  that  a  combination  of  local  links  and  more  global 
communication  is  a  reasonable  approach  when  the  number  of  processors  is  large.  This  approach 
has  been  taken  in  several  parallel  architectures  projects,  e.g.  the  Finite  Element  Machine  [1],  the 
CEDAR  machine  (6j. 

7.  Conclusion 

»  The  analysis  of  the  Gaussian  Elimination  algorithm  proposed  in  this  paper  shows  that  com¬ 

munication  complexity  can  be  a  serious  obstacle  to  speed-up  in  parallel  computation.  Considering 
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arithmetic  alone  one  might  believe  that  using  k  =  .V2  processors  would  speed  up  Gaussian  elim¬ 
ination  by  an  order  of  IV2,  thus  leading  to  an  algorithm  whose  run  time  is  linear  in  .V.  \\V  have 
shown  that  for  a  bus  architecture,  a  ring  or  a  grid  array  this  is  impossible.  We  believe  that  our 
analysis  extends  to  other  array  architectures  as  well  but  we  anticipate  more  complicated  result*. 

Gaussian  elimination  is  a  communication  intensive  algorithm  and  the  main  difficulty  in  at¬ 
tempting  to  speed-up  the  algorithm  stems  from  the  fact  that  increasing  the  number  of  processors, 
also  increases  the  distance  between  the  data  in  nearest- neighbor  type  architectures.  Many  numeri¬ 
cal  methods,  such  as  the  solution  of  partial  differential  equations  by  finite  difference/finite  element 
methods,  can  be  formulated  so  as  to  require  local  communication  only.  For  such  problems,  local  link 
architectures  are  perfectly  suitable.  However,  for  some  communication  intensive  problems,  such  as 
the  solution  of  dense  linear  systems  considered  in  this  paper,  nearest  neighbor  interconnections 
may  seriously  impede  performance. 
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